Skip to content

Feature/prefetch2#1604

Open
maddyscientist wants to merge 124 commits intodevelopfrom
feature/prefetch2
Open

Feature/prefetch2#1604
maddyscientist wants to merge 124 commits intodevelopfrom
feature/prefetch2

Conversation

@maddyscientist
Copy link
Member

This work is latest towards optimizing QUDA for Blackwell:

  • Adds supports for "spatial prefetching", where we over fetch data to L2 when issuing a global load. Exposed as an optional template parameter to vector_load. At present, not deployed anywhere.
  • Add support for prefetching instructions, in the form of both per-thread prefetching (which works on all CUDA architectures), and TMA-based prefetching, which is Hopper+ only. Prefetching type is set using QUDA_DSLASH_PREFETCH CMake parameter, with 0=per-thread, 1=TMA bulk, and 2=TMA descriptor
  • Add an experimental L1 prefetch (using LDGSTS). Disabled, but left for future experiments.
  • Add single-threaded execution region helper function target::is_thread_zero() which should be used for TMA issuance.
  • Optionally store the backward shifted gauge field. This simplifies all dslash indexing, as all spatial indices thus correspond to "this" site. Enabled with QUDA_DSLASH_DOUBLE_STORE=ON which is required for TMA-based prefetching (for alignment reasons).* Prefetching is exposed for both ColorSpinorFields and GaugeFields, though only latter actually used at present.
  • Added prefetching support to both Wilson and Staggered dslash kernels, parameterized using QUDA_DSLASH_PREFETCH_DISTANCE_WILSON and QUDA_DSLASH_PREFETCH_DISTANCE_STAGGERED CMake parameters.
  • Optimization of the neighbor indexing for the dslash kernels. This reduced integer instruction overheads.
  • Reduction in pointer arithmetic overhead (use more 32-bit integer operations where possible). Added three operand and four operand variants of vector_load and vector_store to this end (respectively).
  • Optimize FFMA2 issuance to reduce total number of floating point instructions on Blackwell
  • Optimization of short <-> float conversation to reduce instruction overheads
  • Optimization of staggered packing kernels (replace division by int with division by fast_intdiv)
  • Extends some OpenMP parallelization on the host that was missing.

The end result of this work is that both Staggered and Wilson dslash kernels can saturate over 90% memory bandwidth for most variants. Outstanding are half precision variants using reconstruction, that are still lagging. These will be the focus of a subsequent PR.

…tead of logic operations when computing the neighboring index; this is branch free and less operations
…for executing single-thread regions of code. On CUDA install the latest version of CCCL via CPM since we need some new features
…slash kernels. Disabled by default (set with with Arg::prefetch_distance parameter), and TMA prefetch will be added in next push
…ith QUDA_DSLASH_PREFETCH_BULK=ON). Prefetch distance is now set via CMake (QUDA_DSLASH_PREFETCH_DISTANCE_WILSON and QUDA_DSLASH_PREFETCH_DISTANCE_STAGGERED)
…ants of vector_load and vector_store: these allow for hte pointer offset and the index to be computed together first in 32-bit, before accumulation to the pointer in 64-bit, reducing pointer arithmetic overheads
…d and vector_store to reduce indexing overheads
TMA (Tensor Memory Accelerator) is only available on Hopper (sm_90+) and
later architectures. This commit wraps the cuTensorMapEncodeTiled calls
with a compile-time guard to prevent runtime errors on Volta/Ampere GPUs.
@havogt
Copy link
Contributor

havogt commented Feb 6, 2026

cscs-ci run

@kostrzewa
Copy link
Member

kostrzewa commented Mar 11, 2026

Is performance on AMD regularly benchmarked "officially"? If so, what is being benchmarked?

After recent updates on Lumi-G I had to update our production stack. I was not able to compile the head commit of the develop branch any more (related to what is observed in #1617 I think):

/users/bakostrz/code/quda-develop-e318708/include/targets/hip/../generic/shared_memory_cache_helper.h:127:7: error: no matching constructor for initialization of 'SharedMemory<atom_t<complex<int>[8][8][4][2][2]>, SizeDims<DimsStaticConditional<2, 1, 1>, sizeof(complex<int>[8][8][4][2][2]) / sizeof(atom_t<complex<int>[8][8][4][2][2]>)>, void>' (aka 'SharedMemory<HIP_vector_type<int, 4>, SizeDims<quda::DimsStaticConditional<2, 1, 1>, sizeof(quda::complex<int>[8][8][4][2][2]) / sizeof(atom_t<complex<int>[8][8][4][2][2]>)>, void>')
  127 |       Smem(ops, arg...), block(D::dims(target::block_dim(), arg...)), stride(block.x * block.y * block.z)

nor the (now very old) commit that we used on Lumi-G previously (6198d60):

In file included from /users/bakostrz/code/quda-develop-6198d6/lib/../include/float_vector.h:10:
/users/bakostrz/code/quda-develop-6198d6/lib/../include/complex_quda.h:425:13: error: no member named 'x' in 'complex<ValueType>'
  425 |       this->x *= z;
      |       ~~~~  ^
/users/bakostrz/code/quda-develop-6198d6/lib/../include/complex_quda.h:426:13: error: no member named 'y' in 'complex<ValueType>'
  426 |       this->y *= z; 

which I couldn't fix by trying to backport the changes to quda::complex.

We figured out that the feature/prefetch2 branch compiles, but I observe substantial performance regressions in our tmLQCD+QUDA HMC compared to our production setup which was running until December 2025:

  • about 30% in MG solves as used in our HMC
  • a factor > 2 in updateMultigridQuda
  • a factor > 2.5 in double-half mixed-precision CG (strangely not always)
  • a factor > 2 in single precision multi-shift solves with double-half refinement

Overall this leads to a factor > 2 increase in time per trajectory unfortunately.

I'm unable to pin down what is responsible as we had to update from rocm-5.6.1 (very old, I know, but that was what was available on Lumi-G at the time) to rocm-6.3.4 or rocm-6.4.4 AND make a very large jump in QUDA version.

@maddyscientist
Copy link
Member Author

@kostrzewa thanks for the report on where things stand on ROCm. I think the issue with compilation should be fixed with 9b83fde, which @weinbe2 has just cheery picked into his PR that will be merged shortly, so that should sort out develop working again. We should also make sure that the ROCm CI is captures this failure, which I guess means an update to the CI might be needed (@dmcdougall)?

Regarding the performance regression, do you happen to have a tune cache to hand for before and after? That would help guide us as to where the regression is. I suspect the issue is a compiler driven regression in the dslash performance, but it could also be changes in QUDA itself.

Since that old version of QUDA, on of the biggest changes has been the default data ordering has changed, to what I call "maximal vectorization". What this means, is for example we previously would have use 9x float2 SoA ordering for an 18-real-value gauge field, we would now use 4x float4 + 1x float2 ordering on Hopper, or 2x float8 + 1x float2 ordering on Blackwell. The motivation here is to reduce overall instructions (indexing and load instructions). I did incorporate the ability to use the legacy ordering though in case of regressions. To enable this, set -DQUDA_ORDER_DOUBLE=0 -DQUDA_ORDER_SINGLE=0 -DQUDA_ORDER_HALF=0 and recompile.

…with shifting (can't shift a shifted field), and fix move constructor so that shift field is moved
@kostrzewa
Copy link
Member

kostrzewa commented Mar 12, 2026

@maddyscientist

Regarding the performance regression, do you happen to have a tune cache to hand for before and after? That would help guide us as to where the regression is. I suspect the issue is a compiler driven regression in the dslash performance, but it could also be changes in QUDA itself.

Ah, I always forget to look at the tunecaches. Yes, please find them attached here:

quda_amd_perf_regression.tar.gz

the directory names in the archive should be reasonably self-explanatory.

Looking at some of the kernels in profile_async_0.tsv seems to confirm my observations from the tmLQCD-internal timers w.r.t. to the MG as well as the ndeg twisted clover half precision kernels:

new:      744.781         7.81725         32.2379          242426       0.0030722       56x28x28x16x2        N4quda31NdegTwistedCloverPreconditionedINS_34NdegTwistedCloverPreconditionedArgIsLi3ELi4ENS_4DDNoEL21QudaReconstructType_s18EEEEE       policy,GPU-offline,kernel_arg_threshold=4096,vol=1404928,parity=1,precision=2,Ns=4,Nc=3,order=0,N=8,alt_i2f=0,TwistFlavor=2,commDim=0111,dagger,n_rhs=1,n_rhs_tile=1,topo=14414,order=01234567,p2p=0,gdr=1,nvshmem=0,pol=111111000011000000       # 878.02 Gflop/s, 391.45 GB/s, tuning took 0.235447 seconds at Thu Mar  5 17:43:55 2026

old:      277.445         5.18452          51.208          229761      0.00120754       56x28x28x16x2        N4quda31NdegTwistedCloverPreconditionedINS_34NdegTwistedCloverPreconditionedArgIsLi3ELi4EL21QudaReconstructType_s18EEEEE        policy,GPU-offline,vol=1404928,parity=1,precision=2,order=8,Ns=4,Nc=3,TwistFlavor=2,commDim=0111,dagger,topo=14414,order=01234567,p2p=0,gdr=1,nvshmem=0,pol=111111111111110000    # 2233.85 Gflop/s, 995.93 GB/s, tuning took 0.198384 seconds at Mon Oct 27 19:20:54 2025

I did incorporate the ability to use the legacy ordering though in case of regressions. To enable this, ...

I'll try this right away, thanks!


Going back to the legacy order helps a little.

The situation is subtle because on a 32c64 lattice on 2 nodes (16 GCDs), the prefetch2 branch and legacy order I actually see a slight overall performance improvement with CrayEnv_gnu_rocm_644 over the old commit with gnu_env_23_09_rocm_561.

On 28 nodes on a 112c224 lattice instead I see, as an example:

rocm-561 / 6198d60:

MultiShiftCG: Converged after 1113 iterations
MultiShiftCG:  shift=0, 1113 iterations, relative residual: iterated = 7.732915e-07
MultiShiftCG:  shift=1, 1113 iterations, relative residual: iterated = 1.088741e-09
MultiShiftCG:  shift=2, 604 iterations, relative residual: iterated = 1.064384e-09
# QUDA: Refining shift 0: L2 residual inf / 1.000000e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 1230 iterations, L2 relative residual: iterated = 9.987934e-12, true = 9.987934e-12 (requested = 1.000000e-11)
# QUDA: Refining shift 1: L2 residual inf / 1.000000e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 388 iterations, L2 relative residual: iterated = 9.899012e-12, true = 9.899012e-12 (requested = 1.000000e-11)
# QUDA: Refining shift 2: L2 residual inf / 1.000000e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 232 iterations, L2 relative residual: iterated = 9.888607e-12, true = 9.888607e-12 (requested = 1.000000e-11)
# TM_QUDA: Time for invertMultiShiftQuda 2.549249e+01 s level: 5 proc_id: 0 /HMC/ndcloverrat4:ndrat_heatbath/solve_mms_nd_plus/solve_mms_nd/invert_eo_quda_twoflavour_mshift/invertMultiShiftQuda
[...]
# TM_QUDA: QpQm solve done: 2963 iter / 25.4923 secs = 304594 Gflops
# TM_QUDA: Time for invert_eo_quda_twoflavour_mshift 2.636584e+01 s level: 4 proc_id: 0 /HMC/ndcloverrat4:ndrat_heatbath/solve_mms_nd_plus/solve_mms_nd/invert_eo_quda_twoflavour_mshift

rocm-644 / prefetch2 3c8ed1a / defaults

MultiShiftCG: Converged after 1113 iterations
MultiShiftCG:  shift=0, 1113 iterations, relative residual: iterated = 7.704380e-07
MultiShiftCG:  shift=1, 1113 iterations, relative residual: iterated = 1.084594e-09
MultiShiftCG:  shift=2, 604 iterations, relative residual: iterated = 1.061401e-09
# QUDA: Refining shift 0: L2 residual inf / 1.000000e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 1229 iterations, L2 relative residual: iterated = 9.912935e-12, true = 9.912935e-12 (requested = 1.000000e-11)
# QUDA: Refining shift 1: L2 residual inf / 1.000000e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 619 iterations, L2 relative residual: iterated = 9.893005e-12, true = 9.893005e-12 (requested = 1.000000e-11)
# QUDA: Refining shift 2: L2 residual inf / 1.000000e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 368 iterations, L2 relative residual: iterated = 9.933633e-12, true = 9.933633e-12 (requested = 1.000000e-11)
# TM_QUDA: Time for invertMultiShiftQuda 4.652547e+01 s level: 5 proc_id: 0 /HMC/ndcloverrat4:ndrat_heatbath/solve_mms_nd_plus/solve_mms_nd/invert_eo_quda_twoflavour_mshift/invertMultiShiftQuda
[...]
# TM_QUDA: QpQm solve done: 3329 iter / 46.5254 secs = 187391 Gflops
# TM_QUDA: Time for invert_eo_quda_twoflavour_mshift 4.740475e+01 s level: 4 proc_id: 0 /HMC/ndcloverrat4:ndrat_heatbath/solve_mms_nd_plus/solve_mms_nd/invert_eo_quda_twoflavour_mshift

rocm-644 / prefetch2 3c8ed1a / legacy order

MultiShiftCG: Converged after 1115 iterations
MultiShiftCG:  shift=0, 1115 iterations, relative residual: iterated = 7.645503e-07
MultiShiftCG:  shift=1, 1115 iterations, relative residual: iterated = 1.075038e-09
MultiShiftCG:  shift=2, 604 iterations, relative residual: iterated = 1.065411e-09
# QUDA: Refining shift 0: L2 residual inf / 1.000000e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 1231 iterations, L2 relative residual: iterated = 9.962922e-12, true = 9.962922e-12 (requested = 1.000000e-11)
# QUDA: Refining shift 1: L2 residual inf / 1.000000e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 540 iterations, L2 relative residual: iterated = 9.884733e-12, true = 9.884733e-12 (requested = 1.000000e-11)
# QUDA: Refining shift 2: L2 residual inf / 1.000000e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 262 iterations, L2 relative residual: iterated = 9.888722e-12, true = 9.888722e-12 (requested = 1.000000e-11)
# TM_QUDA: Time for invertMultiShiftQuda 3.699154e+01 s level: 5 proc_id: 0 /HMC/ndcloverrat4:ndrat_heatbath/solve_mms_nd_plus/solve_mms_nd/invert_eo_quda_twoflavour_mshift/invertMultiShiftQuda
[...]
# TM_QUDA: QpQm solve done: 3148 iter / 36.9926 secs = 222860 Gflops
# TM_QUDA: Time for invert_eo_quda_twoflavour_mshift 3.787983e+01 s level: 4 proc_id: 0 /HMC/ndcloverrat4:ndrat_heatbath/solve_mms_nd_plus/solve_mms_nd/invert_eo_quda_twoflavour_mshift

Note that these inversions have identical starting conditions. I guess it's the autotuning which causes the iteration numbers to differ a little. The main point is the time per iteration though / the reported performance.

Sorry for polluting the discussion here with so much stuff. I guess I should have opened a new issue for this...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants